Functional Linear Regression - First Attempt


This document comprises the first attempt to fit a multivariate functional linear regression (mFLR) model, including performing FPCAs for functional variables The available covariates can be classified into four different types:

The first category is not modeled in this approach since there is need for further thinking. The PFT-dependent variables are transformed to PFT-wise covariates to match the general data shape (1803 observations, i.e., disturbed grid cells). For every disturbed grid cell, the soil related variables are already provided, so no further pre-processing is necessary.

The time-varying, location dependent variables, i.e., the climate covariates and total nitrogen uptake are functional as the response. This implies that these covariates need to be transformed into multivariate data. In particular, in this first approach, the two climate variables \(tas\_yearlymean\) and \(pr\_yearlysum\) are represented as PC scores in the final model formulation, which result from two univariate FPCAs. Details on that are derived in the following.

FPCA for Climate Covariates

Functional Fits

The first step in functional data analysis is to find an appropriate representation for the functional data. Figure 1 shows the functional fit for the average annual temperature (\(tas\_yearlymean\)). Note that there is no basis representation or smoother involved here. The black line indicates the mean curve. According to the curves, there is a slight trend towards warmer temperatures in the later decades of the study period. But this result is less meaningful considering the curves comprise all four scenarios.


Figure 1: Functional fit for average annual temperature.
Figure 1: Functional fit for average annual temperature.

Figure 2 shows the functional fit for the annual average temperature for each scenario. As expected, with rising radial forcing, the annual mean temperature increases.


Figure 2: Functional fit for average annual temperature per scenario.
Figure 2: Functional fit for average annual temperature per scenario.

Figure 3 visualizes the annual sum of precipitation as function for each disturbed grid cell. The overall mean function indicates no trend in time across the study period.


Figure 3: Functional fit for annual precipitation
Figure 3: Functional fit for annual precipitation

As before, Figure 4 shows the scenario-wise functional fits of precipitation. Interestingly, one may hypothesize, that the more radial forcing, the less variation in precipitation.


Figure 4: Functional fit for annual precipitation per scenario.
Figure 4: Functional fit for annual precipitation per scenario.

In general, all plots indicated that most of the variation in the data is captured by horizontal lines. This is proved by considering the principal components in the next section.


Principal Components

Performing a FPCA or MFPCA using the R package includes steps within the functions which are not obvious to the user. For instance, the basis for the basis representation is set automatically, and the functional fits cannot be further explored. There is need for diving deeper into the functionality of this package to fully understand the basis representation used for running the FPCA or MFPCA.

Figure 5 and 6 show the first two principal components for both annual mean temperature and annual precipitation, respectively. As already indicated in the figures above, nearly all the variance can be captured by \(y=0\).


Figure 5: First and second principal component of covariate “Annual mean temperature”.
Figure 5: First and second principal component of covariate “Annual mean temperature”.

Figure 6: First and second principal component of covariate “Annual precipitation”.
Figure 6: First and second principal component of covariate “Annual precipitation”.

Note that nearly all the variance in the data is already captured by the first PC for both variables.


Questions here: What do I have to look at when I want to see that PC 1 and PC 2 are orthogonal to each other?


Reconstructions

The performed FPCAs serve as dimensionality reducer and smoothing operator. This becomes clear when looking at the reconstructed curves using the first 10 PCs visualized in Figure 7.


Figure 7: Reconstructed curves using 10 PCs for both covariates “Annual mean temperature” and “Annual precipitation”.
Figure 7: Reconstructed curves using 10 PCs for both covariates “Annual mean temperature” and “Annual precipitation”.

With these results, all provided covariates are prepared for modeling in the next step.


Multivariate Functional Linear Regression (mFLR) model

Combining all covariates together results in the following model formula:


\(\begin{pmatrix}\text{PC1} \\\text{PC2}\end{pmatrix}= \ \begin{pmatrix}\beta_{0,1} \\\beta_{0,2}\end{pmatrix}+\begin{pmatrix}\beta_{1,1} \\ \beta_{1,2}\end{pmatrix} \cdot \text{Lon}+\begin{pmatrix}\beta_{2,1} \\ \beta_{2,2}\end{pmatrix} \cdot \text{Lat}+\begin{pmatrix}\beta_{3,1} \\ \beta_{3,2}\end{pmatrix} \cdot \text{sand_fraction} \\ +\begin{pmatrix}\beta_{4,1} \\ \beta_{4,2}\end{pmatrix} \cdot\text{silt_fraction}+\begin{pmatrix}\beta_{5,1} \\ \beta_{5,2}\end{pmatrix} \cdot \text{bulkdensity_soil}+\begin{pmatrix}\beta_{6,1} \\ \beta_{6,2}\end{pmatrix} \cdot \text{ph_soil} \\ +\begin{pmatrix}\beta_{7,1} \\ \beta_{7,2}\end{pmatrix} \cdot \text{soilcarbon}+\begin{pmatrix}\beta_{8,1} \\ \beta_{8,2}\end{pmatrix} \cdot \text{time_since_dist}\\+\begin{pmatrix}\beta_{9,1} \\ \beta_{9,2}\end{pmatrix} \cdot\text{Scenario}_{\text{SSP1-RCP2.6}} + \begin{pmatrix}\beta_{10,1} \\ \beta_{10,2}\end{pmatrix} \cdot\text{Scenario}_{\text{SSP3-RCP7.0}}+ \begin{pmatrix}\beta_{11,1} \\ \beta_{11,2}\end{pmatrix} \cdot\text{Scenario}_{\text{SSP5-RCP8.5}} \\+\begin{pmatrix}\beta_{12,1} \\ \beta_{12,2}\end{pmatrix} \cdot \text{initial_recruitment_BNE}+\begin{pmatrix}\beta_{13,1} \\ \beta_{13,2}\end{pmatrix} \cdot\text{initial_recruitment_IBS}+\begin{pmatrix}\beta_{14,1} \\ \beta_{14,2}\end{pmatrix} \cdot\text{initial_recruitment_TeBS} \\ +\begin{pmatrix}\beta_{15,1} \\ \beta_{15,2}\end{pmatrix} \cdot\text{initial_recruitment_Tundra}+\begin{pmatrix}\beta_{16,1} \\ \beta_{16,2}\end{pmatrix} \cdot\text{initial_recruitment_otherC}\\+\begin{pmatrix}\beta_{17,1} \\ \beta_{17,2}\end{pmatrix} \cdot\text{recruitment_ten_years_BNE} +\begin{pmatrix}\beta_{18,1} \\ \beta_{18,2}\end{pmatrix} \cdot\text{recruitment_ten_years_IBS}+\begin{pmatrix}\beta_{19,1} \\ \beta_{19,2}\end{pmatrix} \cdot\text{recruitment_ten_years_TeBS}\\+\begin{pmatrix}\beta_{20,1} \\ \beta_{20,2}\end{pmatrix} \cdot\text{recruitment_ten_years_Tundra} +\begin{pmatrix}\beta_{21,1} \\ \beta_{21,2}\end{pmatrix} \cdot\text{recruitment_ten_years_otherC}\\+\begin{pmatrix}\beta_{22,1} \\ \beta_{22,2}\end{pmatrix} \cdot\text{previous_state_BNE}+\begin{pmatrix}\beta_{23,1} \\ \beta_{23,2}\end{pmatrix} \cdot \text{previous_state_IBS} +\begin{pmatrix}\beta_{24,1} \\ \beta_{24,2}\end{pmatrix} \cdot \text{previous_state_TeBS}\\+\begin{pmatrix}\beta_{25,1} \\ \beta_{25,2}\end{pmatrix} \cdot \text{previous_state_Tundra}+\begin{pmatrix}\beta_{26,1} \\ \beta_{26,2}\end{pmatrix} \cdot \text{previous_state_otherC}\\ +\begin{pmatrix}\beta_{27,1} \\ \beta_{27,2}\end{pmatrix} \cdot \text{PC1_temp}+\begin{pmatrix}\beta_{28,1} \\ \beta_{28,2}\end{pmatrix} \cdot \text{PC1_precip}+\begin{pmatrix}\epsilon_1\\\epsilon_2\end{pmatrix}\)


Since the coil composition adds up to 1, one of the three composition is left out (\(clay\_fraction\)). Note that so far only linear effects are considered. All variables but \(tas\_yearlymin\), \(tas\_yearlymax\) and both nitrogen uptake variables (\(Nuptake\) and \(Nuptake\_total\)) are considered here.

The model summary including the effect estimates are given here:


## Response PC1 :
## 
## Call:
## lm(formula = PC1 ~ Lon + Lat + sand_fraction + silt_fraction + 
##     bulkdensity_soil + ph_soil + soilcarbon + time_since_dist + 
##     Scenario + initial_recruitment_BNE + initial_recruitment_IBS + 
##     initial_recruitment_TeBS + initial_recruitment_Tundra + initial_recruitment_otherC + 
##     recruitment_ten_years_BNE + recruitment_ten_years_IBS + recruitment_ten_years_TeBS + 
##     recruitment_ten_years_Tundra + recruitment_ten_years_otherC + 
##     previous_state_BNE + previous_state_IBS + previous_state_TeBS + 
##     previous_state_Tundra + previous_state_otherC + PC1_temp + 
##     PC1_precip, data = d_train %>% select(-clay_fraction))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.6860  -1.9649  -0.0065   1.9652   7.4668 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -2.199e+01  1.962e+00 -11.209  < 2e-16 ***
## Lon                          -6.785e-03  8.459e-04  -8.021 2.18e-15 ***
## Lat                           2.605e-01  1.630e-02  15.980  < 2e-16 ***
## sand_fraction                 2.192e+01  1.223e+00  17.913  < 2e-16 ***
## silt_fraction                 1.410e+01  2.113e+00   6.671 3.64e-11 ***
## bulkdensity_soil             -3.708e+00  7.292e-01  -5.085 4.16e-07 ***
## ph_soil                      -3.083e-01  1.710e-01  -1.803  0.07155 .  
## soilcarbon                   -4.044e-01  6.431e-02  -6.287 4.29e-10 ***
## time_since_dist               4.367e-04  5.782e-04   0.755  0.45029    
## ScenarioSSP1-RCP2.6          -1.794e+00  2.126e-01  -8.439  < 2e-16 ***
## ScenarioSSP3-RCP7.0          -1.308e+00  2.409e-01  -5.430 6.62e-08 ***
## ScenarioSSP5-RCP8.5          -1.343e+00  2.590e-01  -5.183 2.50e-07 ***
## initial_recruitment_BNE       1.083e-02  2.404e-03   4.506 7.14e-06 ***
## initial_recruitment_IBS       2.792e-03  4.060e-03   0.688  0.49175    
## initial_recruitment_TeBS      4.895e-02  1.832e-02   2.672  0.00763 ** 
## initial_recruitment_Tundra   -9.698e-03  3.652e-03  -2.655  0.00801 ** 
## initial_recruitment_otherC   -9.606e-03  4.702e-03  -2.043  0.04123 *  
## recruitment_ten_years_BNE    -3.910e-04  1.730e-04  -2.260  0.02398 *  
## recruitment_ten_years_IBS     1.259e-03  6.809e-04   1.850  0.06458 .  
## recruitment_ten_years_TeBS   -6.397e-03  3.263e-03  -1.960  0.05014 .  
## recruitment_ten_years_Tundra  3.390e-04  2.358e-04   1.438  0.15074    
## recruitment_ten_years_otherC  6.479e-04  1.414e-03   0.458  0.64684    
## previous_state_BNE            4.751e-03  2.538e-02   0.187  0.85152    
## previous_state_IBS           -7.149e-03  1.818e-02  -0.393  0.69414    
## previous_state_TeBS           1.656e-01  1.141e-01   1.451  0.14689    
## previous_state_Tundra         5.425e-03  1.529e-01   0.035  0.97171    
## previous_state_otherC        -1.832e-01  3.789e-02  -4.834 1.48e-06 ***
## PC1_temp                      6.787e-02  3.770e-03  18.002  < 2e-16 ***
## PC1_precip                   -6.060e-04  3.770e-05 -16.074  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.632 on 1414 degrees of freedom
## Multiple R-squared:  0.6317, Adjusted R-squared:  0.6244 
## F-statistic: 86.61 on 28 and 1414 DF,  p-value: < 2.2e-16
## 
## 
## Response PC2 :
## 
## Call:
## lm(formula = PC2 ~ Lon + Lat + sand_fraction + silt_fraction + 
##     bulkdensity_soil + ph_soil + soilcarbon + time_since_dist + 
##     Scenario + initial_recruitment_BNE + initial_recruitment_IBS + 
##     initial_recruitment_TeBS + initial_recruitment_Tundra + initial_recruitment_otherC + 
##     recruitment_ten_years_BNE + recruitment_ten_years_IBS + recruitment_ten_years_TeBS + 
##     recruitment_ten_years_Tundra + recruitment_ten_years_otherC + 
##     previous_state_BNE + previous_state_IBS + previous_state_TeBS + 
##     previous_state_Tundra + previous_state_otherC + PC1_temp + 
##     PC1_precip, data = d_train %>% select(-clay_fraction))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.1257 -0.2266  0.2996  0.7167  8.0351 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   2.101e+00  1.093e+00   1.923 0.054667 .  
## Lon                          -4.370e-05  4.711e-04  -0.093 0.926101    
## Lat                          -3.391e-02  9.081e-03  -3.734 0.000196 ***
## sand_fraction                -3.264e-01  6.814e-01  -0.479 0.632003    
## silt_fraction                -2.079e+00  1.177e+00  -1.766 0.077555 .  
## bulkdensity_soil              3.047e-01  4.061e-01   0.750 0.453218    
## ph_soil                       3.198e-02  9.523e-02   0.336 0.737043    
## soilcarbon                    3.124e-02  3.582e-02   0.872 0.383254    
## time_since_dist              -2.938e-04  3.220e-04  -0.912 0.361725    
## ScenarioSSP1-RCP2.6           4.812e-01  1.184e-01   4.064 5.09e-05 ***
## ScenarioSSP3-RCP7.0           7.720e-01  1.342e-01   5.753 1.07e-08 ***
## ScenarioSSP5-RCP8.5           7.207e-01  1.443e-01   4.995 6.60e-07 ***
## initial_recruitment_BNE       3.850e-04  1.339e-03   0.288 0.773740    
## initial_recruitment_IBS      -8.364e-03  2.261e-03  -3.699 0.000225 ***
## initial_recruitment_TeBS     -1.302e-01  1.020e-02 -12.762  < 2e-16 ***
## initial_recruitment_Tundra    2.016e-03  2.034e-03   0.991 0.321727    
## initial_recruitment_otherC   -3.387e-03  2.618e-03  -1.293 0.196055    
## recruitment_ten_years_BNE     7.771e-05  9.637e-05   0.806 0.420173    
## recruitment_ten_years_IBS     1.375e-03  3.792e-04   3.625 0.000299 ***
## recruitment_ten_years_TeBS    1.821e-02  1.817e-03  10.018  < 2e-16 ***
## recruitment_ten_years_Tundra -2.332e-04  1.313e-04  -1.776 0.075953 .  
## recruitment_ten_years_otherC  1.117e-04  7.874e-04   0.142 0.887213    
## previous_state_BNE           -3.093e-02  1.413e-02  -2.188 0.028800 *  
## previous_state_IBS           -3.948e-03  1.012e-02  -0.390 0.696581    
## previous_state_TeBS          -3.350e-01  6.354e-02  -5.272 1.56e-07 ***
## previous_state_Tundra        -2.055e-01  8.518e-02  -2.412 0.015987 *  
## previous_state_otherC        -9.794e-02  2.110e-02  -4.642 3.78e-06 ***
## PC1_temp                      2.264e-02  2.100e-03  10.782  < 2e-16 ***
## PC1_precip                    3.524e-05  2.100e-05   1.678 0.093484 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.466 on 1414 degrees of freedom
## Multiple R-squared:  0.3486, Adjusted R-squared:  0.3357 
## F-statistic: 27.03 on 28 and 1414 DF,  p-value: < 2.2e-16

For a valid interpretation, the first two PCs of the target are depicted at the end of this document.

Questions here: Is there any possibility to trace back the estimates on single PFTs? For instance, in PC1 we can see, the more extreme the scenario, the larger the \(scenario\) estimate. Looking at the PCs at the bottom indicates more BNE, less IBS etc. But what happens for the single PFTs? E.g. I would suggest that the higher the temperature, the more TeBS, but how could I see that in the model? Is it even possible?

Next, the model is evaluated. Figure 6 shows residual and Q-Q plots for both PCs.


Figure 8: Model residuals and Q-Q plots for both predicted PCs (on train data set).
Figure 8: Model residuals and Q-Q plots for both predicted PCs (on train data set).

Prior to running the model, the data was split into test and train data sets. In order to evaluate, how good the model performs on unseen data, Figure 9 shows the true PC scores vs. the predicted PC scores for the test set.


Figure 9: True vs. fitted PC scores for unseen data.
Figure 9: True vs. fitted PC scores for unseen data.

The basis functions estimated by the MFPCA for the target can now be used to transform the predicted PC scores of the unseen data into functional data. As an example, Figure shows for four randomly picked disturbed grid cells, one for each scenario, the true functional fit and the predicted one from the model.


Figure 10: True functional fit and predicted functional fit for four different example grid cells, one for each scenario.
Figure 10: True functional fit and predicted functional fit for four different example grid cells, one for each scenario.

To conclude, this first attempt seems rather promising, but there is need for improvement.

Possible adaptations:

  • MFPCA for all three climate variables
  • FPCA for total nitrogen uptake and/or include PFT-wise nitrogen uptake (as MFPCA over all 5 PFTs)
  • Check for measurement error (Berkson error?)
  • Check for non-linear effects
  • Check for correlations within covariates
  • Variable selection (AIC/BIC, forward/backward/stepwise selection)

First two PCs of target MFPCA for interpretation


Figure 11: First PC of target variable: BNE, IBS, otherC, TeBS, Tundra.
Figure 11: First PC of target variable: BNE, IBS, otherC, TeBS, Tundra.

Figure 12: Second PC of target variable: BNE, IBS, otherC, TeBS, Tundra.
Figure 12: Second PC of target variable: BNE, IBS, otherC, TeBS, Tundra.